Confusion with Deduplicating; No Longer Creating Addtl Alerts on the Same Issue

justin.swall · February 15, 2022, 7:10pm

Something has changed recently that’s broken our use of PagerDuty and I don’t know how to fix it.

Previously, we’d generate alerts via e-mail and deduplicate them based on the Ticket number our system generates in the subject line. Anything between # and / would belong to the same incident/alert.

(FYI - I’m using those two words, alert and incident, interchangeably as we’ve been with PagerDuty since before these were separate things. I’m still not totally sure I understand what the difference is between these two things as it applies to our use of PagerDuty and this might be the source of our problems.)

So, as I was saying any e-mail with the same ticket number in the subject line would be deduplicated. This is because our system will send out multiple e-mail alerts about the same ticket every 10 minutes until the issue was addressed in our system. This allowed PagerDuty to only have one incident/alert in the system and generate only one page.

Once the issue was addressed in our system, an e-mail would kick off to PagerDuty closing out the incident/alert and it would be shown as resolved in the PD system.

If in the future however, the same ticket number would enter a failed state and starting generating alerts again, a new incident/alert would be created in PagerDuty then send off the page. Again, additional messages from our system about the same issue would be deduplicated until it was addressed in our system and the “all clear” was sent to PagerDuty to close out the incident/alert.

The problem is that second thing is no longer happening.

Once a ticket generates an alert/incident in PagerDuty and it gets resolved, if that same ticket goes back into a failed state again at a future date and the e-mail with the alert gets sent off to PagerDuty, PagerDuty does nothing with that e-mail. It doesn’t generate another incident/alert and we don’t get paged.

This is obviously horrible as this means we’re missing important pages.

I’m not sure when this problem started happening. As pages were still coming in it seemed like it was working, but at least once or twice something that should have generated a page didn’t generate one and I just figured it might be a fluke. The same thing happened a third time this morning so I started doing testing and figured out the pattern mentioned above.

Anyone have any idea what we might need to change to work inside this new incident/alert system? I’m assuming that’s somehow related as to why this is no longer triggering a new alert/page… but I could be wrong. I’m just guessing and grabbing at straws here.

Thanks for any help or ideas you might have.

– Justin Swall

dmcclure · February 16, 2022, 2:11pm

A few questions to get some more context:

Are you using an event orchestration or event ruleset email integration or a service level email integration?
Are you using regex to extract the ticket ID from your email subject and use that for your dedup key or are you just using the entire subject line as-is?
Is the subject line changing at all from the first trigger email and follow on emails? (for example, do follow on emails have “Re:” or “Fwd:” or anything different? (assuming you’re using full subject line here)
Are you defining certain conditions to map into a unique 'trigger" and ‘resolve’ action in your rules configuration?

justin.swall · February 16, 2022, 3:30pm

Thanks for the reply @dmcclure

Our setup is very simple since it’s been humming along for years before things like Event Orchestration were part of the platform. Essentially, we just have a couple of Services defined. Each service has only three settings enabled - the On Call Now, the Escalation Policy, and an Email Integration.

Each Email Integration is configured identically to the one shown below. The only reason we even have multiple services defined is just because then it shows the service impacted on the notification in the Pager Duty app. We can then route the various types of alerts to the various e-mail addresses tied to the various services and get a bit more clarity on what type of an alert is coming in.

Let me know if there’s any additional information I can provide. Thanks!

dmcclure · February 16, 2022, 6:02pm

Are dedup/incident keys getting created as expected with the value between # and / ? If so, can you manually send in an email that should trigger and one that should resolve and have the incident created and resolved as expected?

I might also try creating a new service and email integration and test with manually sending an email just to sanity check if it works or not with a new service/email address.

justin.swall · February 17, 2022, 2:22am

The plot thickens.

I’ve done some additional testing this evening and when I do my manual test it is all working correctly as expected. I do still have a specific e-mail that I can confirm was sent that never kicked off a page, so I’ve reported the issue to support.

Doesn’t seem like this will be a forum-solvable problem. Thanks for all of your help though!